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Abstract 

Only about 1,000 qualitatively different protein folds are believed to exist 
in nature. Here, we review theoretical studies which suggest that some folds 
are intrinsically more designable than others, i.e. are lowest energy states 
of an unusually large number of sequences. The sequences associated with 
these folds are also found to be unusually thermally stable. The connection 
between highly designable structures and highly stable sequences is generally 
known as the "designability principle". The designability principle may help 
explain the small number of natural folds, and may also guide the design of 
new folds. 
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I. INTRODUCTION 

Two remarkable features of natural proteins are the simple fact that they fold and the 
limited number of distinct folds they adopt. Random amino-acid sequences typically do 
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not fold to a unique structure. Rather they have many competing configurations of similar 
minimum free energy. Nature has evolved sequences that do fold stably, but it is estimated 
that the total number of qualitatively different folds is only about 1,000 [1-3]. 

To attempt to explain these remarkable features of natural proteins, we have proposed 
a principle of designability [4-8]. The designability of a structure is the number of se- 
quences which have that structure as their unique lowest-energy configuration [8]. In a wide 
range of models, structures are found to differ dramatically in designability, and sequences 
associated with highly designable structures have unusually high thermal stability [8-12]. 
We refer to this connection between the designability of a structure and the stability of 
its associated sequences as the designability principle. In model studies, highly designable 
structures are rare. As a result, thermally stable sequences are also rare. We conjecture 
that the designability principle also applies to real proteins, and that natural protein folds 
are exceptional, highly designable structures. 

In this article, we review the designability principle. We start from a minimal model of 
protein structure in which the designability of a structure can be understood geometrically 
as the size of its basin of attraction in sequence space. More detailed models, including 
all 20 amino-acid types and off-lattice backbone configurations, reinforce the basic principle 
and provide a framework for the design of qualitatively new protein folds. 

II. PURELY HYDROPHOBIC (PH) MODEL 

Generally, the folding of proteins relies on the formation of a hydrophobic core of amino 
acids. Consideration of hydrophobicity alone leads to a very simple description of proteins- 
the "purely hydrophobic" (PH) model [9]. In this model, sequences consist of only two 
types of amino acids, hydrophobic and polar [13]. Structures are compact walks on a cu- 
bic or square lattice. An example of a 6 x 6 square structure is shown in Fig. 1(a). As 
indicated, each structure consists of core sites surrounded by surface sites. The energy of a 
particular sequence folded into a particular structure is the number of hydrophobic amino 
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acids occupying core sites, multiplied by — 1, 



N 



E = -^2 Sihi. 



(1) 



A binary string {si} represents each folded structure: Sj = 1 if site % along the chain is in 
the core, and = if the site is on the surface. Similarly, a binary string {hi} represents 
each sequence: hi — 1 if the zth amino acid in the sequence is hydrophobic, and hi — if 
the amino acid is polar. 

Within the PH model, structures differ dramatically in their designability. In practice, 
the designability is obtained by sampling a large number of binary sequences, and, for each 
sequence, recording the unique lowest-energy structure if there is one. Finally, the number 
of sequences which map to, i.e. "design", each structure is summed to give the designability 
Ns of the structure. Fig. 1(b) shows a histogram of designability Ns for compact 6x6 
structures. There are 57,337 structures, with 30,408 distinct binary strings. Most structures 
have a designability around 50, but a small number of structures have designabilities more 
than 10 times this high. If sequences were randomly assigned to structures, the result would 
the Poisson distribution which is shown for comparison, and there would be no structures 
with such high designability. 

Importantly, the PH model has a simple geometrical representation that explains both 
the wide range of designabilities and the close connection between designability and thermal 
stability. To find the relative energies of different structures for a given sequence, the energy 
in Eq. (1) can be replaced by 



This replacement is allowed because the extra term h\ is a constant for a given sequence, 
and the other extra term s i is also a constant, equal to the number of core sites, for all 
compact structures. Eq. (2) indicates that the energy of a sequence folded into a particular 
structure is simply the Hamming distance [14] between their respective binary strings. So, 
for a given sequence, the lowest-energy structure is simply the closest structure. The des- 
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ignability of a structure is therefore the exclusive volume of binary strings (sequences) that 
lie closer to it than to any other structure, as shown schematically in Fig. 2. 

The wide range of structure designabilities can be traced to the nonuniform density of 
structures in the space of binary strings. Most structures are found in dense regions, i.e. 
in clusters of structures with similar patterns of surface and core sites. Structures found 
in these crowded regions have small exclusive volumes, and so, by definition, have small 
designabilities. In fact, many groups of distinct structures share an identical surface-core 
pattern (binary string) and therefore have zero designability. In contrast, a few structures 
fall in low-density regions, that is they have unusual surface-core patterns, and so have large 
exclusive volumes. These are the highly designable structures. In Fig. 3(a), we plot the 
number of structures n(d) at a Hamming distance d from a structure with low, intermediate, 
and high designability, respectively. It shows that both low- and high-density neighborhoods 
typically have a large spatial extent, reaching nearly halfway across the space of binary 
strings. 

The geometrical representation of the PH model makes clear the connection between 
thermal stability and designability. A sequence is considered to be thermally unstable if 
it has a small or vanishing energy gap between its lowest-energy structure and all other 
structures, and if there are many such competing structures. In the PH model, the energy 
of a sequence folded into a particular structure is the distance between their binary strings. 
A sequence which folds to a structure in a dense region (cf. Fig. 2) will necessarily lie close to 
many other structures, and will therefore have many competing low-energy conformations. 
Even if the sequence perfectly matches the structure, with hydrophobic amino acids at all 
core sites and polar amino acids at all surface sites, the high surrounding density of structures 
with similar surface-core patterns implies a large number of competing folds. This is the 
hallmark of thermal instability. Therefore, the low-designability structures, which are found 
in high-density regions, will have associated sequences which are thermally unstable. 

In contrast, if a sequence folds to a structure in a low-density region, there will be 
relatively few nearby structures, and so relatively few competing folds. These sequences will 
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be thermally stable. Therefore, the highly designable structures, from low-density regions, 
will have associated sequences of high thermal stability. This is the designability principle 
in a nutshell-high designability and thermal stability are connected because both arise from 
low-density regions in the space of binary strings which represent folded structures. 

A measure of the "neighborhood density" of structures around a particular structure is 
the variance 7 of the quantity n(d) shown in Fig. 3(a). The variance 7 is directly related to 
the thermal stability-smaller 7 implies lower neighhood density and hence higher thermal 
stability. In Fig. 3(b) we plot this variance as a function of designability. It shows a strong 
correlation between the designability and the thermal stability. 

Since low-energy competing structures also represent kinetic traps, one expects the ther- 
mally stable sequences associated with highly designable structures to be fast folders as well. 
This has been tested for a lattice model closely related to the PH model [15]. 

III. MIYAZAWA-JERNIGAN (MJ) MATRIX MODEL 

Natural proteins contain 20 amino acids, not two, and their interactions are more com- 
plicated than simple hydrophobic solvation. Some of these real-world features are captured 
in the Miyazawa-Jernigan (MJ) matrix model. The MJ matrix is a set of amino-acid interac- 
tion energies inferred from the propensities of different types of amino acids to be neighbors 
in natural folded structures [16]. The model assigns the appropriate energy from the MJ 
matrix to every pair of amino acids that are on neighboring lattice sites, but are not adja- 
cent (covalently bonded) on the chain, as indicated in Fig. 4(a) [11]. In studies using the 
MJ-matrix model, there are generally too many possible sequences (20^) to enumerate, but 
the relative designabilities of structures can be obtained accurately by random sampling of 
sequences. 

Fig. 5(a) shows a histogram of designability for compact 6x6 structures obtained using 
the MJ matrix of interaction energies. The form of the distribution is very similar to the 
PH-model histogram, including the tail of highly designable structures (Fig. 1(b)). There 
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is also a strong correlation between thermal stability and designability in the MJ-matrix 
model [11]. For thermal stability, one can use some measure of "neighborhood" density of 
states. We find that in the MJ-matrix model, the thermal stability of a sequence folded 
into a structure is well correlated with the local energy gap [17] between the lowest-energy 
structure and the next lowest. Fig. 5(b) shows the energy gap averaged over sequences 
which fold to structures of a given designability Ng. With increasing designability, there is a 
clear increase in the average gap, and hence in the thermal stability of associated sequences. 
The results of the MJ-matrix model are very similar to those obtained with the PH model. 
Indeed, the same structures are found to be highly designable in both models, including 
the same top structure shown in Fig. 4(a). The most designable 3x3x3 structure is 
shown in Fig. 4(b). Qualitatively, the results of the MJ-matrix model are the same for 
three-dimensional structures (Fig. 6) as for two-dimensional ones. 

Why are the results of the purely hydrophobic model and the Miyazawa-Jernigan-matrix 
model so similar? In fact, both models are dominated by hydrophobic solvation energies. 
The interaction energy between any two amino acids % and j in the MJ matrix can be well 
approximated by —(hi + hj), where hi is an effective hydrophobicity for each amino acid [18]. 
This implies that the energy of formation of a non-covalent nearest-neighbor pair is simply 
the desolvation energy of shielding one face of each amino acid from the surrounding water. 
As a result, the MJ-matrix model can be viewed as a variant of the PH model in which 
there are 20 possible values of hydrophobicity instead of just two. An additional distinction 
is that the MJ-matrix model has a range of different site types (core, edge, and corner in 
two dimensions; core, edge, face, and corner in three dimensions) rather than just surface 
and core as in the PH model. Overall, these differences are not enough to alter the basic 
designability principle, or even to change the set of highly designable 6x6 structures. 

For the MJ-matrix model, one can still construct a space of structure strings, now in- 
cluding several levels of solvent exposure between (most exposed) and 1 (most buried). 
As in the PH model, some regions of this space are dense with structures and some have 
few structures. Structures with similar surface-exposure strings compete for sequences. As 
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a result, structures in high-density regions have small basins of attraction for sequences, 
and structures in low-density regions have large basins. Moreover, sequences associated 
with structures in low-density regions have few competing conformations and are the most 
thermally stable. Therefore, the designability principle holds in the MJ-matrix model for 
the same reason it does in the PH model: high designability and high thermal stability are 
connected because both arise in low-density regions in the space of strings, i.e. the space of 
surface-exposure patterns of structures. 

Lattice-protein models in which hydrophobic solvation does not dominate may show 
different behavior. For example, Buchler and Goldstein reported results of a variant of the 
MJ-matrix model in which the dominant hydrophobic term — (hi + hj) had been subtracted 
out [19]. They found a set of highly designable structures different from that obtained with 
the full MJ matrix, and similar to the set obtained for a random pairing potential between 
amino acids. 

IV. OFF-LATTICE MODELS 

Natural proteins fold in three dimensions, and their main degrees of freedom are bond 
rotations. Does the designability principle extend to off-lattice models with more realistic 
degrees of freedom? One model for which designability has been studied off-lattice is a 
3-state discrete-angle model, of the type introduced by Park and Levitt [20]. The results 
strongly confirm the designability principle, and suggest the possibility of creating new, 
highly designable folds in the laboratory [10]. 

The main degrees of freedom of a protein backbone are the dihedral angles and if). 
Certain pairs of (p-ifj angles are preferred in natural structures, since they lead to conserved 
secondary structures such as a-helices and /3-strands [21]. Discrete- angle models for protein 
structure take advantage of these preferences by allowing only certain combinations of angles. 
For an m-pair model, the total number of backbone structures grows as m N . With m — 3, it 
is possible to computationally enumerate all structures up to roughly N = 30 amino acids. 
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Figure 7(a) shows an example of a protein backbone of length N = 23 generated using a 
3-state model with (4> ) ip) = (-95,135), (-75,-25), and (-55,-55), where the first pair of angles 
corresponds to a /3-strand and the second two correspond to variants of a-helices. Structures 
are decorated with spheres representing sidechains, as shown in Fig. 7(b). Only compact 
self-avoiding structures are considered as possible protein folds. 

To assess designability of these off-lattice structures, the solvent-exposed area of each 
sidechain sphere is evaluated. An energy of hydrophobic solvation is defined as in Eq. (1) 
by E = — J2iLi SiK, where now Sj is the fractional exposure to solvent of the ith sidechain, 
and the hi are amino- acid hydrophobicities. Figure 8(a) shows a histogram of designability 
for the 3-state model. There is a wide range of designabilities, with a tail of very highly 
designable structures. A strong correlation exists between the designability of a structure 
and the thermal stability of its associated sequences, as shown in Fig. 8(b). 

The designability principle evidently applies to the 3-state model as well as to the lattice 
models discussed earlier. This is not surprising, because folding in the 3-state model is also 
driven by hydrophobic solvation. Each structure in the model is represented by a string 
of sidechain solvent exposures, represented by real numbers between and 1. Again, the 
space of these strings has high- and low-density regions, with the, by now familiar, relation 
between low density and high designability and thermal stability leading to the designability 
principle. 

A major advantage of the 3-state model is that it addresses structures that a real polypep- 
tide chain can adopt. Among the highly designable folds, one recovers several recognizable 
natural structures, including an a-turn-a fold and a zincless zinc-finger. In addition, some 
of the highly designable folds, including a (3-a-f3 structure, have not been observed in na- 
ture as independent domains. Results of our effort to create this fold in the laboratory are 
encouraging [22]. 
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V. DISCUSSION AND CONCLUSION 



The designability principle has been explored in a number of models for proteins, includ- 
ing all 20 amino acids and realistic backbone conformations. In these models, the strong 
link between designability and thermal stability can be traced to the dominance of the hy- 
drophobic solvation energy. Whenever hydrophobicity is dominant, each structure can be 
reduced to its pattern of solvent exposure. In the same vein, each sequence can be reduced 
to its pattern of hydrophobicity. Sequences will fold so as to best match their hydrophobic 
residues to the buried sites of structures. Both designability (number of sequences per struc- 
ture) and thermal stability depend on a competition among structures with similar patterns 
of solvent exposure. Highly designable structures are those with unusual patterns of surface 
exposure, and therefore with few competitors. This lack of competitors also implies that the 
sequences folding to highly designable structures are thermally stable. 

Since hydrophobicity is generally accepted to be the dominant force for folding of real 
proteins, the designability principle may provide a guide to understanding the selection 
of natural protein structures. Of course, real proteins are held together by forces other 
than hydrophobicity. Next to hydrophobicity, the formation of hydrogen bonds is the most 
important factor in determining how a typical protein folds. The backbone hydrogen bonds 
of a-helices and /^-sheets help stabilize particular folds. These secondary structures can be 
incorporated within the framework of designability as a favorable energy bias for formation 
of a-helices and /9-sheets. 

One way to incorporate hydrogen bonding in the design of new protein folds is to specify 
in advance the secondary structure of the protein. This approach has the added advantage 
of greatly reducing the number of degrees of freedom. The desired secondary structures can 
be designed into the sequence via the propensities of particular amino acids to form ct-helices 
and /9-strands. 

This approach to design was recently carried out for four-helix bundles [12]. Compact, 
self-avoiding structures consisting of four tethered 15-residue a-helices were generated and 
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assessed for designability. Figure 9 shows the four most designable distinct folds, which 
closely correspond to natural four-helix bundles. As shown in Fig. 10, the histogram of 
designability for the four-helix model has the characteristic long tail of highly designable 
structures. 

The principle of designability has been motivated here in terms of hydrophobic solvation. 
More generally, the dependence of both designability and thermal stability on a competition 
among structures broadens the application of the principle. For example, designability and 
thermal stability have been found to correlate in non-solvation models including random- 
interaction models [19] and folding of two-letter RNA [23]. In the future, we hope that 
designability will provide a guide to the design of new structures both for polymers other 
than proteins and for solvents other than water. 

We gratefully acknowledge the contributions of many coworkers in developing the notion 
of designability, in particular Eldon Emberly, Robert Helling, Regis Melin, Jonathan Miller, 
Tairan Wang, and Chen Zeng. 
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FIGURES 




FIG. 1. (a) A 6 x 6 compact structure and its corresponding string. In the "purely hydrophobic" 
(PH) model, only two types of sites are considered, surface sites and core sites. The core is shown 
enclosed by a dotted line. Each structure is represented by a binary string Sj (i = 1, . . . , 36) of Os 
and Is representing surface and core sites, respectively, (b) Histogram of designability Ns for the 
6x6 PH model, obtained using 19,492,200 randomly chosen sequences. A Poisson distribution 
with the same mean is shown for comparison. 
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FIG. 2. Schematic representation of sequences and structures in the purely hydrophobic (PH) 
model. Dots represent sequences, i.e. all binary strings. Dots with circles represent binary strings 
associated with compact structures. Multiple circles indicate degenerate strings, i.e. strings associ- 
ated with more than one compact structure. In the PH model, the energy of a sequence folded into 
a particular structure is the Hamming distance between their binary strings. Hence the number of 
sequences which fold uniquely to a particular structure-the designability of the structure-is the set 
of vertices lying closer to that structure than to any other, as indicated for one particular structure 
by the shaded region. 



14 



-| Q° I i I i I i I i I i I i I i I i I 

4 8 12 16 20 24 28 32 
d 




10' r 



1 Lj , I , I ' 1 , I , E 

2.0 3.0 4.0 5.0 6.0 7.0 

y 



FIG. 3. (a) Number of neighboring structures n(d) versus distance d to neighbors for three 
representative 6x6 structures, with low (circles), intermediate (triangles), and high (squares) 
designability. The distance between structures is defined as the Hamming distance between their 
binary strings, (b) Designability versus 7, the variance of n(d), for all 6 x 6 structures. 
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FIG. 4. (a) Most designable 6x6 structure using 20 amino-acid types. Only noncovalent 
nearest-neighbor interactions contribute to the energy, as indicated by dashed lines for a few pairs. 
Interaction energies are taken from the Miyazawa-Jernigan (MJ) matrix, (b) Most designable 
3x3x3 structure using the same MJ-matrix energies. 
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FIG. 5. (a) Histogram of designability Ns for the 6x6 MJ-matrix model, (b) Average gap 
versus designability for the 6x6 MJ-matrix model. Data obtained using 9,095,000 randomly chosen 
sequences. 
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FIG. 6. (a) Histogram of designability Ns for the 3x3x3 MJ-matrix model, (b) Average gap 
versus designability for the 3x3x3 MJ-matrix model. Data obtained using 13,550,000 randomly 
chosen sequences. 
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FIG. 7. (a) Example of a compact, self-avoiding 23-mer backbone generated using three dihe- 
dral-angle pairs, (b) Backbone with generic sidechain spheres centered on C a positions. 
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FIG. 8. (a) Histogram of designability for 23-mer off-lattice structures of the type shown in 
Fig. 7. (b) Average energy gap (black dots) and largest energy gap (red dots) versus designability. 
Data generated by enumeration of all binary sequences. 
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FIG. 9. Four most designable four-helix bundles generated by packing tethered 15-residue 
a-helices. The helices are numbered at their N-terminals. 



21 




FIG. 10. Histogram of designability N$ for four-helix bundles. Data obtained using 2,000,000 
randomly chosen binary sequences. 
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